Skip to content

Add per-phase Prometheus analysis and consistent phase logging to loa…#179

Merged
akrzos merged 1 commit into
redhat-performance:mainfrom
akrzos:phased-prometheus-analysis-and-logging
Jun 23, 2026
Merged

Add per-phase Prometheus analysis and consistent phase logging to loa…#179
akrzos merged 1 commit into
redhat-performance:mainfrom
akrzos:phased-prometheus-analysis-and-logging

Conversation

@akrzos

@akrzos akrzos commented Jun 18, 2026

Copy link
Copy Markdown
Member

Extract launch_prometheus_analysis() into shared utils/analysis.py module
used by both acm-deploy-load.py and acm-telco-core-load.py. Add phased
workload structure (Idle Baseline, Cluster Deployment, Soak Baseline) with
automatic background Prometheus analysis at phase boundaries for resource
consumption comparison testing. Align logging, report generation, and section
naming across both load tools.

  • Add -k/--kubeconfig and --no-prometheus-analysis CLI args to acm-deploy-load.py
  • Run 3 background analyses: idle-baseline, cluster-deployment, soak-baseline
  • Add periodic countdown logging during idle and soak baseline phases
  • Consolidate workload parameter logging into structured block matching telco-core pattern
  • Move telco-core inline report to generate_telco_core_report() in utils/output.py
  • Rename generate_report() to generate_deploy_load_report() for clarity
  • Rename "Deployed Cluster Orchestration" to "Workload Parameters" in report
  • Align report section names: Workload Parameters, Workload Duration Results, Workload Phases
  • Enrich deploy-load report Workload Parameters with phase durations, wait configs, and cluster details

@akrzos akrzos force-pushed the phased-prometheus-analysis-and-logging branch 26 times, most recently from 6823578 to 9f5568d Compare June 22, 2026 18:20
@akrzos akrzos marked this pull request as ready for review June 22, 2026 18:33
@akrzos akrzos force-pushed the phased-prometheus-analysis-and-logging branch 2 times, most recently from 9f5568d to 0987ed7 Compare June 23, 2026 14:04
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@akrzos, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 51 minutes and 5 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses rolling per-developer review limits. Reviews become available again as older review attempts age out of the rolling limit window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 3f6c07bd-99ec-4dfe-b26a-bf835a32afc1

📥 Commits

Reviewing files that changed from the base of the PR and between ccb26fc and c9fa893.

📒 Files selected for processing (8)
  • README.md
  • acm-deploy-load/acm-deploy-load.py
  • acm-deploy-load/acm-telco-core-load.py
  • acm-deploy-load/utils/analysis.py
  • acm-deploy-load/utils/common_ocp.py
  • acm-deploy-load/utils/output.py
  • acm-deploy-load/utils/ztp_monitor.py
  • scripts/interval-ztp-install-all.sh
📝 Walkthrough

Walkthrough

The PR introduces a shared utils/analysis.py with launch_prometheus_analysis and kubeconfig validation, refactors utils/output.py to replace generate_report with generate_deploy_load_report and add generate_telco_core_load_report, adds kubeconfig propagation to ZTPMonitor, and rewrites both acm-deploy-load.py and acm-telco-core-load.py to execute an explicit three-phase workflow (idle baseline → cluster deployment → soak baseline) with per-phase Prometheus analysis invocations.

Changes

Three-phase ACM workload orchestration with Prometheus analysis

Layer / File(s) Summary
Shared Prometheus analysis utility
acm-deploy-load/utils/analysis.py
New module implements launch_prometheus_analysis, which checks for the analyzer script, enforces a 15-minute minimum window, builds a subprocess command with kubeconfig and phase name, and redirects combined output to a per-phase log file.
Kubeconfig validation helper
acm-deploy-load/utils/common_ocp.py
New function validate_kubeconfig executes oc whoami using the provided kubeconfig via the command() wrapper, exits on failure, and logs the authenticated identity on success.
ZTPMonitor kubeconfig support
acm-deploy-load/utils/ztp_monitor.py
Constructor gains a kubeconfig parameter stored as self.kubeconfig; all oc get calls in _real_run are updated to pass --kubeconfig self.kubeconfig for every monitored resource type.
Report generation utilities
acm-deploy-load/utils/output.py
generate_report is replaced by generate_deploy_load_report (adds soak_start_time, per-phase idle/deployment/soak duration fields, and Workload Parameters/Phases report sections); generate_telco_core_load_report is added with mode-branching report content for telco workloads.
acm-deploy-load.py CLI and initialization
acm-deploy-load/acm-deploy-load.py
Adds --kubeconfig and --no-prometheus-analysis flags; updates --method default; clarifies --start-delay/--end-delay help text to explicitly refer to Phase 1 idle baseline and Phase 3 soak baseline; imports updated to use launch_prometheus_analysis and generate_deploy_load_report; AAP detection is kubeconfig-aware; kubeconfig validation is invoked after parsing; report paths computed early.
acm-deploy-load.py Phase 1 idle baseline orchestration
acm-deploy-load/acm-deploy-load.py
Workload Parameters logging is refactored to include Phase 1/Phase 3 descriptions and conditional Prometheus analysis messaging; Phase 1 performs chunked sleep based on start_delay, sets deploy_start_time after completion, and conditionally runs launch_prometheus_analysis with the phase1-idle-baseline window.
acm-deploy-load.py Phase 2/3 deployment and soak orchestration
acm-deploy-load/acm-deploy-load.py
Non-GitOps oc apply command includes --kubeconfig cliargs.kubeconfig; Phase 2 is followed by conditional launch_prometheus_analysis with phase2-cluster-deployment window; Phase 3 soak baseline performs chunked sleep based on end_delay and conditionally runs launch_prometheus_analysis with phase3-soak-baseline window; final report calls generate_deploy_load_report with expanded phase-timing argument list.
acm-telco-core-load.py refactor to shared utilities
acm-deploy-load/acm-telco-core-load.py
Removes local launch_prometheus_analysis function and inline report.txt writing; imports both from utils.*; renames Phase 1 to "Idle Baseline" and Phase 3 to "Soak Baseline" with updated Prometheus phase window names; rewrites Workload Parameters logging with mode-specific branches (Deploy+Policy/Deploy-only vs Policy-only); introduces deploy_batch_count for timing computations; delegates final report to generate_telco_core_load_report.
Shell script and documentation
scripts/interval-ztp-install-all.sh, README.md
Script switches method to ibi-clusterinstance-gitops, adds start_delay, end_delay, and prometheus_analysis_arg variables, and passes them to acm-deploy-load.py. README replaces the prior deploy/wait/report bullet list with the three-phase workflow description and --no-prometheus-analysis note.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding per-phase Prometheus analysis and consistent phase logging to the load testing tools.
Description check ✅ Passed The description is directly related to the changeset, providing detailed context about extracting shared utilities, implementing phased workloads, and aligning reporting across both tools.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@acm-deploy-load/acm-telco-core-load.py`:
- Around line 298-302: The idle baseline and soak baseline Prometheus analysis
launches are gated on `cliargs.no_deploy == False` at lines 298 and 429
respectively, but the Policy-only mode logging at lines 253-254 indicates
Prometheus analysis should run. To make baseline analysis mode-independent as
suggested, remove the `cliargs.no_deploy == False` condition from both the
launch_prometheus_analysis calls for the idle baseline and soak baseline windows
so they execute regardless of deployment mode. Keep the Phase 2 batch analysis
behind the deploy gate since it depends on actual deployment occurring.

In `@acm-deploy-load/utils/analysis.py`:
- Around line 38-50: The command list construction and the logger.info call that
logs the command using " ".join(cmd) are currently placed before the try block.
Move both the cmd list definition and the logger.info statement inside the try
block so that if kubeconfig is unset or non-string, any resulting error will be
properly caught by the exception handler instead of bypassing the controlled
error handling path.

In `@acm-deploy-load/utils/output.py`:
- Around line 228-285: The code checks three conditional branches for different
combinations of no_deploy and no_policy flags but does not handle the case where
both cliargs.no_deploy and cliargs.no_policy are True. This causes phase2_label
to remain unassigned in that scenario, leading to an UnboundLocalError when
phase2_label is referenced in the log_write call at line 285. Add an additional
elif or else branch after the three existing conditions to explicitly handle the
case where both flags are True, and assign an appropriate value to phase2_label
in that branch.

In `@scripts/interval-ztp-install-all.sh`:
- Around line 12-33: The wrapper in interval-ztp-install-all.sh is not passing a
single kubeconfig value through both deploy and Prometheus analysis flows, so
the two steps can target different clusters. Update the script’s main
phase-argument wiring and the deploy/analyzer invocations to accept and reuse
the same kubeconfig variable consistently, using the existing phase option
handling around the wrapper logic and analyzer call sites.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 7f27caf6-ea5b-4e32-afb4-ca8c16ac926c

📥 Commits

Reviewing files that changed from the base of the PR and between c6a1e4c and 0987ed7.

📒 Files selected for processing (7)
  • README.md
  • acm-deploy-load/acm-deploy-load.py
  • acm-deploy-load/acm-telco-core-load.py
  • acm-deploy-load/utils/analysis.py
  • acm-deploy-load/utils/output.py
  • acm-deploy-load/utils/ztp_monitor.py
  • scripts/interval-ztp-install-all.sh

Comment thread acm-deploy-load/acm-telco-core-load.py
Comment thread acm-deploy-load/utils/analysis.py
Comment thread acm-deploy-load/utils/output.py
Comment thread scripts/interval-ztp-install-all.sh
@akrzos akrzos force-pushed the phased-prometheus-analysis-and-logging branch from 0987ed7 to ccb26fc Compare June 23, 2026 15:03
@akrzos

akrzos commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@akrzos

akrzos commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
acm-deploy-load/utils/output.py (1)

228-285: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Add the missing no_deploy=True + no_policy=True branch before using phase2_label.

phase2_label is uninitialized in the both-true mode and then used at Line 285, causing UnboundLocalError during report generation.

Suggested fix
     elif cliargs.no_deploy == True and cliargs.no_policy == False:
       phase2_label = "Policy Updates"
       log_write(report, " * Mode: Policy configmap updates only")
       log_write(report, " * Phase 1 / Idle Baseline (Start delay): {}s :: {}".format(
           cliargs.start_delay, str(timedelta(seconds=cliargs.start_delay))))
       log_write(report, " * Phase 2 ({}):".format(phase2_label))
       log_write(report, "  * Update policy configmap ({} keys) in namespace {} per {}s interval".format(
           cliargs.hub_policy_cm_keys, cliargs.hub_policy_namespace, cliargs.interval_policy))
       log_write(report, "   * Maximum number of policy intervals to run: {}".format(cliargs.max_policy_intervals))
       log_write(report, " * Phase 3 / Soak Baseline (End delay): {}s :: {}".format(
           cliargs.end_delay, str(timedelta(seconds=cliargs.end_delay))))
+    else:
+      phase2_label = "No active workload"
+      log_write(report, " * Mode: Idle baseline + soak baseline only")
+      log_write(report, " * Phase 1 / Idle Baseline (Start delay): {}s :: {}".format(
+          cliargs.start_delay, str(timedelta(seconds=cliargs.start_delay))))
+      log_write(report, " * Phase 2 ({}): no deploy/policy actions".format(phase2_label))
+      log_write(report, " * Phase 3 / Soak Baseline (End delay): {}s :: {}".format(
+          cliargs.end_delay, str(timedelta(seconds=cliargs.end_delay))))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@acm-deploy-load/utils/output.py` around lines 228 - 285, The variable
phase2_label is initialized in three conditional branches checking combinations
of cliargs.no_deploy and cliargs.no_policy, but the fourth case where both are
True is missing, causing an UnboundLocalError when phase2_label is used later at
line 285. Add an elif branch for the case where cliargs.no_deploy == True and
cliargs.no_policy == True, and initialize phase2_label with an appropriate
descriptive label (such as "No Operations" or similar) for this scenario to
ensure the variable is always defined before being used in the subsequent
log_write calls.
acm-deploy-load/utils/analysis.py (1)

38-50: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Move command logging/build into guarded path and validate kubeconfig before join.

At Line 48, " ".join(cmd) can throw if kubeconfig is not a string, and that happens before the try, so the function can crash instead of hitting the controlled warning path.

Suggested fix
-  cmd = [
-    sys.executable,
-    analyzer_script,
-    "-k", kubeconfig,
-    "-s", start_str,
-    "-e", end_str,
-    "-b", "0",
-    "-p", phase_name,
-    report_dir,
-  ]
-  logger.info("Prometheus analysis command: {}".format(" ".join(cmd)))
   log_file = os.path.join(report_dir, "pa-{}.log".format(phase_name))
   try:
+    if not isinstance(kubeconfig, str) or not kubeconfig:
+      logger.warning("Skipping prometheus analysis phase {}: kubeconfig is not set".format(phase_name))
+      return
+    cmd = [
+      sys.executable,
+      analyzer_script,
+      "-k", kubeconfig,
+      "-s", start_str,
+      "-e", end_str,
+      "-b", "0",
+      "-p", phase_name,
+      report_dir,
+    ]
+    logger.info("Prometheus analysis command: {}".format(" ".join(cmd)))
     with open(log_file, "w") as f:
       proc = subprocess.Popen(
         cmd,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@acm-deploy-load/utils/analysis.py` around lines 38 - 50, The command logging
at line 48 using the join operation happens before the try block, which means if
kubeconfig is not a string, the join will fail and crash the function instead of
being handled by the exception handler. Move the logger.info call that logs the
command and the log_file assignment inside the try block so they are guarded by
exception handling. Additionally, validate that kubeconfig is a string before
constructing the cmd list to prevent the join operation from failing due to
invalid kubeconfig type.
🧹 Nitpick comments (1)
acm-deploy-load/acm-deploy-load.py (1)

338-345: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Validate kubeconfig before running AAP detection.

Running detect_aap_install(...) before validate_kubeconfig(...) can log a misleading “AAP install not detected” when kubeconfig is invalid, then fail immediately after. Validate first, then run detection.

Suggested reorder
-  # Detect AAP install
-  if detect_aap_install(cliargs.kubeconfig, cliargs.dry_run):
-    logger.info("AAP install detected, waiting for playbook completion")
-    cliargs.wait_playbook = True
-  else:
-    logger.info("AAP install not detected")
-
-  # Validate parameters
-  validate_kubeconfig(cliargs.kubeconfig)
+  # Validate parameters
+  validate_kubeconfig(cliargs.kubeconfig)
+
+  # Detect AAP install
+  if detect_aap_install(cliargs.kubeconfig, cliargs.dry_run):
+    logger.info("AAP install detected, waiting for playbook completion")
+    cliargs.wait_playbook = True
+  else:
+    logger.info("AAP install not detected")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@acm-deploy-load/acm-deploy-load.py` around lines 338 - 345, Reorder the
validation logic so that validate_kubeconfig is called before the
detect_aap_install check. Move the validate_kubeconfig(cliargs.kubeconfig) call
to execute first, before the if block that calls detect_aap_install. This
ensures the kubeconfig is validated as valid before attempting AAP detection,
preventing misleading "AAP install not detected" log messages when the
kubeconfig is actually invalid.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@acm-deploy-load/utils/analysis.py`:
- Around line 38-50: The command logging at line 48 using the join operation
happens before the try block, which means if kubeconfig is not a string, the
join will fail and crash the function instead of being handled by the exception
handler. Move the logger.info call that logs the command and the log_file
assignment inside the try block so they are guarded by exception handling.
Additionally, validate that kubeconfig is a string before constructing the cmd
list to prevent the join operation from failing due to invalid kubeconfig type.

In `@acm-deploy-load/utils/output.py`:
- Around line 228-285: The variable phase2_label is initialized in three
conditional branches checking combinations of cliargs.no_deploy and
cliargs.no_policy, but the fourth case where both are True is missing, causing
an UnboundLocalError when phase2_label is used later at line 285. Add an elif
branch for the case where cliargs.no_deploy == True and cliargs.no_policy ==
True, and initialize phase2_label with an appropriate descriptive label (such as
"No Operations" or similar) for this scenario to ensure the variable is always
defined before being used in the subsequent log_write calls.

---

Nitpick comments:
In `@acm-deploy-load/acm-deploy-load.py`:
- Around line 338-345: Reorder the validation logic so that validate_kubeconfig
is called before the detect_aap_install check. Move the
validate_kubeconfig(cliargs.kubeconfig) call to execute first, before the if
block that calls detect_aap_install. This ensures the kubeconfig is validated as
valid before attempting AAP detection, preventing misleading "AAP install not
detected" log messages when the kubeconfig is actually invalid.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 38857c1c-70cd-4ab7-b00a-efb24f3ec9ba

📥 Commits

Reviewing files that changed from the base of the PR and between c6a1e4c and ccb26fc.

📒 Files selected for processing (8)
  • README.md
  • acm-deploy-load/acm-deploy-load.py
  • acm-deploy-load/acm-telco-core-load.py
  • acm-deploy-load/utils/analysis.py
  • acm-deploy-load/utils/common_ocp.py
  • acm-deploy-load/utils/output.py
  • acm-deploy-load/utils/ztp_monitor.py
  • scripts/interval-ztp-install-all.sh

… load tools

Extract launch_prometheus_analysis() into shared utils/analysis.py module
used by both acm-deploy-load.py and acm-telco-core-load.py. Add phased
workload structure (Idle Baseline, Cluster Deployment, Soak Baseline) with
automatic background Prometheus analysis at phase boundaries for resource
consumption comparison testing. Align logging, report generation, and section
naming across both load tools.

- Add -k/--kubeconfig and --no-prometheus-analysis CLI args to acm-deploy-load.py
- Run 3 background analyses: idle-baseline, cluster-deployment, soak-baseline
- Add periodic countdown logging during idle and soak baseline phases
- Consolidate workload parameter logging into structured block matching telco-core pattern
- Move telco-core inline report to generate_telco_core_report() in utils/output.py
- Rename generate_report() to generate_deploy_load_report() for clarity
- Rename "Deployed Cluster Orchestration" to "Workload Parameters" in report
- Align report section names: Workload Parameters, Workload Duration Results, Workload Phases
- Enrich deploy-load report Workload Parameters with phase durations, wait configs, and cluster details

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@akrzos akrzos force-pushed the phased-prometheus-analysis-and-logging branch from ccb26fc to c9fa893 Compare June 23, 2026 19:03
@akrzos akrzos merged commit cd37df2 into redhat-performance:main Jun 23, 2026
1 check passed
@akrzos akrzos deleted the phased-prometheus-analysis-and-logging branch June 23, 2026 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant